WEBVTT

00:00.750 --> 00:03.800
Hello and welcome back to our costs about summary statistics.

00:03.800 --> 00:06.060
So why are some of these statistics useful.

00:06.090 --> 00:13.170
Um yeah let's assume you have an array with 1000 or 10000 elements and you want to summarize these elements

00:13.200 --> 00:15.080
in one or two numbers.

00:15.230 --> 00:21.790
And for this task an umpire provides us with many options how to summarize in large numerical arrays.

00:21.840 --> 00:24.950
So first of all that's the important umpire.

00:24.990 --> 00:32.770
Then we create uh eleven random integers in the range one to one hundred and assign the variable A.

00:34.080 --> 00:43.560
Then we start a net we can see so now we have uh the lowest numbers eighteen forty eight and the highest

00:43.570 --> 00:45.230
integers ninety nine.

00:45.510 --> 00:48.480
And then we can calculate the maximum of our array.

00:48.510 --> 00:54.560
So the highest element with the method top Max it's ninety nine here.

00:55.650 --> 01:00.180
And we can also do this trap here by applying the function and picked up Max.

01:00.180 --> 01:03.650
So this gives us uh the same result.

01:04.020 --> 01:06.230
And earlier we learned about the function Max.

01:06.240 --> 01:08.400
So we can also apply the function Max.

01:09.600 --> 01:10.770
So we can see them.

01:10.840 --> 01:16.710
Yeah actually three alternatives to calculate the maximum and also with the minimum they are the same

01:16.710 --> 01:17.720
three alternatives.

01:17.750 --> 01:17.970
Um.

01:18.020 --> 01:20.540
And I only show here the number pi method.

01:20.590 --> 01:31.170
Duckman so I would expect 18 and we can also calculate the mean of all elements that end put up mean.

01:32.570 --> 01:39.380
So it's seventy four are we use the method that mean should be the same.

01:39.810 --> 01:42.150
And we can also calculate the median.

01:43.170 --> 01:45.000
So let's run it.

01:45.230 --> 01:45.800
Eighty four.

01:45.810 --> 01:50.970
So what is uh the median so the median gets less.

01:51.020 --> 01:57.620
Actually the point where 50 percent of all other points are lower than and 50 percent of all elements

01:57.620 --> 01:58.910
are higher than the median.

01:58.910 --> 02:04.200
So we have eleven elements the median is the sixth element.

02:04.200 --> 02:09.060
So we have five elements which are lower than the median and five elements which are higher than the

02:09.060 --> 02:09.870
median.

02:09.870 --> 02:12.020
So this is a very simple words.

02:12.030 --> 02:20.050
And in the simple example here the median and we can also calculate the standard deviation of our lower

02:20.070 --> 02:21.430
elements.

02:21.540 --> 02:22.980
So it gives us 24.

02:23.790 --> 02:28.420
So the standard deviation is a measure of the liability of our elements.

02:28.920 --> 02:35.510
So if all elements would be let's say 50 our standard deviation would be zero.

02:35.520 --> 02:40.470
If the variability is quite high so then also the standard deviation is high but I don't want to go

02:40.470 --> 02:46.270
into statistical details here and our variance is also a measure of the reliability.

02:46.320 --> 02:50.360
And actually standard deviation is the square root of variance.

02:50.370 --> 02:53.260
But this is not a cross about statistics actually.

02:53.370 --> 02:56.910
And we can also calculate the person tile off our array.

02:56.920 --> 03:01.540
So let's say the here the function and p dot percentile.

03:01.650 --> 03:09.780
We have our array and then we say we want to determine our 10th percentile and it's forty eight.

03:09.780 --> 03:10.420
So what.

03:10.440 --> 03:12.120
What does it mean exactly.

03:12.120 --> 03:19.350
So the 10th percentile means that this is the point actually where 10 percent of all the other elements

03:19.350 --> 03:22.350
are lower than this point and 90 percent are higher.

03:22.350 --> 03:26.660
So you can see here we have the 10th percentile forty eight.

03:26.730 --> 03:31.110
And while that one element is lower and the other nine elements are higher.

03:31.170 --> 03:32.780
So this is an easy words.

03:32.790 --> 03:44.060
And the person type and also we can calculate the 19th percentile so it's 98 and you can see here 90

03:44.070 --> 03:47.150
percent of our elements are lower than the 19th percentile.

03:47.160 --> 03:54.210
So we have four main elements and only 10 percent or in this case one element is higher than the 19th

03:54.210 --> 03:58.920
percentile.

03:58.980 --> 04:05.550
All right so let's again create a an array with eleven random integers in the range between 1 and 100

04:05.580 --> 04:11.080
and let's call it a and we create another array with them.

04:11.100 --> 04:15.770
Eleven random integers in the range 1 and 101 but you can see here.

04:15.840 --> 04:22.080
So we have a different random seed and I would expect now that we get different random numbers here

04:22.080 --> 04:26.380
for our array B there can see here.

04:26.400 --> 04:28.180
So the arrays are different.

04:28.320 --> 04:33.810
Now we can calculate the covariance of A and B with the function and p dot Cove

04:37.470 --> 04:42.880
and the covariance is actually measure how to erase the two sequences of numbers that move together

04:43.360 --> 04:45.010
so he it gives us a matrix.

04:45.040 --> 04:51.450
Um the first element is actually the parents of element on Empire A.

04:52.330 --> 04:56.600
Then here we have the variance of our array B.

04:56.650 --> 04:58.170
And here we have two times them.

04:58.180 --> 05:00.770
The covariance of A and B.

05:00.910 --> 05:08.110
So as you can see here is the minus sign both sequences should be kind of negatively correlated and

05:08.350 --> 05:16.420
we can better see this then by calculating the correlation coefficient by using the function N.P. dot

05:16.420 --> 05:20.130
Car Co F so also here we get here.

05:20.140 --> 05:24.870
This is the correlation of array avers itself.

05:24.940 --> 05:25.860
It's 1 now.

05:26.080 --> 05:32.560
So it's no surprise that a is perfectly correlated with a B you get the correlation of our repeats with

05:32.650 --> 05:33.990
itself.

05:34.060 --> 05:35.860
And here we now we have two times.

05:35.950 --> 05:39.350
The correlation of a and b and it's negative.

05:39.370 --> 05:46.330
So whenever a increases B should have a tendency to decrease and the correlation coefficient as they

05:46.330 --> 05:53.650
are between minus 1 and 1 and minus 1 means a perfect negative correlation and one means the perfect

05:53.650 --> 05:58.850
positive correlation and correlation of zero means actually no correlation.

05:58.870 --> 06:05.830
So he can see we have a slight negative correlation between num PIRA a and b and actually because we

06:05.830 --> 06:12.010
created both arrays Yeah independently and randomly we would expect actually to have zero correlation

06:12.010 --> 06:13.230
between those two.

06:13.330 --> 06:19.050
And if you would increase the the amount of elements in each array so let's say instead of having eleven

06:19.110 --> 06:25.700
one thousand elements then I would assume that we get a correlation which is very close to zero.

06:25.720 --> 06:26.920
All right so this is it.

06:26.950 --> 06:29.800
This wasn't a quick overview or a summary statistic.

06:29.810 --> 06:35.590
So there are a lot of options here none pi and in the next session with the US again with the realisation

06:35.620 --> 06:37.450
and regression with an umpire.

06:37.510 --> 06:38.320
See you there.
